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ABSTRACT 

The Wiener index of a graph is the sum of all pairwise 
shortest-path distances between its vertices. In this paper 
we study the novel problem of finding a minimum Wiener 
connector-, given a connected graph G = {V, E) and a set 
Q C 1/ of query vertices, find a subgraph of G that connects 
all query vertices and has minimum Wiener index. 

We show that Min Wiener Connector admits a 
polynomial-time (albeit impractical) exact algorithm for the 
special case where the number of query vertices is bounded. 
We show that in general the problem is NP-hard, and has no 
PTAS unless P = NP. Our main contribution is ^constant- 
factor approximation algorithm running in time 0{\Q\\E\). 

A thorough experimentation on a large variety of real- 
world graphs confirms that our method returns smaller and 
denser solutions than other methods, and does so by adding 
to the query set Q a small number of “important” vertices 
(i.e., vertices with high centrality). 

1. INTRODUCTION 

Suppose we have identified a set of subjects in a terror¬ 
ist network suspected of organizing an attack. Which other 
subjects, likely to be involved, should we keep under con¬ 
trol? Similarly, given a set of patients infected with a viral 
disease, which other people should we monitor? Given a set 
of proteins of interest, which other proteins participate in 
pathways with them? 

Each of these questions can be modeled as a graph-query 
problem: given a graph G = (V) E) and a set of query ver¬ 
tices Q C 1/, find a subgraph H oi G which “explains” t\ie 
connections existing among the nodes in Q, that is to say 
that H must be connected and contain all query vertices 
in Q. We call this query-dependent subgraph a connector. 

While there exist many methods for query-dependent sub¬ 
graph extraction (discussed later in Section [l.l[ ), the bulk of 
this literature aims at finding a “community” around the 
set of query vertices Q-. the implicit assumption is that the 
vertices in Q belong to the same community, and a good so¬ 
lution will contain other vertices belonging to the same com- 
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munity of Q. When such an assumption is satisfied, these 
methods return reasonable subgraphs. But when the query 
vertices belong to different modules of the input graph, these 
methods tend to return too large a subgraph, often so large 
as to be meaningless and unusable in real applications. 

The goal of this paper is different, as we do not aim at re¬ 
constructing a community. Instead we seek a small connec¬ 
tor: a connected subgraph of the input graph which contains 
Q and a small set of important additional vertices. These ad¬ 
ditional vertices could explain the relation among the ver¬ 
tices in Q, or could participate in some function by acting 
as important links among the vertices in Q. We achieve this 
by defining a new, parameter-free problem where, although 
the size of the solution connector is left unconstrained, the 
objective function itself takes care of keeping it small. 

Specifically, given a graph G — {V, E) and a set of query 
vertices Q GV, our problem asks for the connector H* min¬ 
imizing the sum of shortest-path distances among all pairs 
of vertices (i.e., the Wiener index [^) in the solution H*: 

H* = argmin V dGis]{u,v) 

where G[5'] denotes the subgraph induced by a set of 
nodes S, and ciG[s](u, u) denotes the shortest-path distance 
between nodes u and v in Gl^]. We call H* the minimum 
Wiener connector for query Q. 

This is a very natural problem to study: shortest paths 
define fundamental structural properties of graphs, playing 
a role in all the basic mechanisms of networks such as their 
evolution and the formation of communities [^. The 
fraction of shortest paths that a vertex takes part in is called 
its betweenness centrality [^, and is a well established mea¬ 
sure of the importance of a vertex, i.e., the extent to which 
an actor has control over information flow. As our exper¬ 
iments in Section show, a consequence of our definition 
of minimum Wiener connector is that our solutions tend 
to include vertices which hold an important position in the 
network, i.e., vertices with high betweenness centrality. 

Consider social and biological networks with their modu¬ 
lar structure [26] (i.e., the existence of communities of ver¬ 
tices densely connected inside, and sparsely connected with 
the outside). When the query vertices Q belong to the same 
community, the additional nodes added to Q to form the 
minimum Wiener connector will tend to belong to the same 
community. In particular, these will typically be vertices 
with higher “centrality” than those in Q: these are likely to 
be influential vertices playing leadership roles in the commu¬ 
nity. These might be good users for spreading information, 
or to target for a viral marketing campaign [34|. 




a few. As discussed above, most of the existing approaches 
aim at finding a community aronnd the seeds in Q-. these 
methods end up producing very large solutions, especially 
when the qnery vertices are not in the same community. Our 
goal instead is to produce a small connector by adding a few 
central vertices. Another important distinction is that many 
researchers have considered only the cases where |Q| = 1 
|14||15[[M] or IQI =2 [^. Finally, onr method is parameter- 
free, while several existing methods have many parameters 
which make a direct comparison complicated. In the follow¬ 
ing we provide a brief overview of this body of literature, 
highlighting the distinctions w.r.t. our proposal. 


Figure 1: Example of two minimum Wiener connectors on 
the Zachary’s “karate club” social network: on the left the 
query vertices Q (in dark gray) belong to different commu¬ 
nities, on the right they belong to the same community. 

Instead, when the query vertices in Q belong to different 
communities, the additional vertices added to Q to form the 
minimum Wiener connector will contain vertices adjacent 
to edges that “bridge” the different communities. These also 
have strategic importance: information has to go over these 
bridges to propagate from a community to others, thus the 
vertices incident to bridges enjoy a strategically favorable 
position because they can block information, or access it 
before other individuals in their community. These vertices 
are said to span a “structural hole” [^: they are the best 
candidates to target for blocking the spread of rumors or 
viral diseases in a social network, or the spread of malware 
in a network of computers. In a protein-protein interaction 
network these vertices can represent proteins that play a key 
role in linking modules and whose removal can have different 
phenotypic effects. 

As an example, consider the classic Zachary’s “karate 
club” toy social network with known community struc¬ 
ture: a dispute between the club president (vertex 34) and 
the instructor (vertex 1) led to the club splitting into two. 
In Figure[^we show two different minimum Wiener connec¬ 
tors: the one on the left has the query nodes Q (in dark 
gray) belonging to the two different communities, while in 
the example on the right, all the query vertices belong to 
the same community. As discussed above, we can observe 
that when the query vertices span over different commu¬ 
nities, the minimum Wiener connector will include vertices 
incident to bridging edges. This is the case in our example in 
Figure (left): given Q = {12,25,26,30} the solution sub¬ 
graph H* adds to Q the vertices 1 and 34 (the leaders of the 
two communities) and the vertex 32, which is one of the few 
vertices connecting 1 and 34 (which do not have a direct con¬ 
nection) and thus practically bridging the two communities. 
By contrast, in the example on the right, the query vertices 
Q = (4,12,17} belong to the same community, and as ex¬ 
pected the solution remains inside the community: in this 
case we just add two vertices, one of which is the community 
leader (vertex 1), which holds a very central position. 

1.1 Related work 

At a high level our problem can be described as the prob¬ 
lem of finding an interesting connected subgraph of G con¬ 
taining a set of query vertices Q. Several problems of this 
type have been studied under different names, depending on 
the objective function adopted: local community detection, 
seed set expansion, connectivity subgraphs, just to mention 


Random-walk methods. Many authors have adopted 
random-walk-based approaches to the problem of Hnding 
vertices related to a seed of vertices: this is the basic idea 
of Personalized PageRank [33[ |29| . Spielman and Teng pro¬ 
pose methods that start with a seed and sort all other ver¬ 
tices by their degree-normalized PageRank with respect to 
the seed . Andersen and Lang and Andersen et al. 
build on these methods to formulate an algorithm for detect¬ 
ing overlapping communities in networks. In a recent work, 
Kloumann and Kleinberg 37 provide a systematic evalua¬ 
tion of different methods for seed set expansion on graphs 
with known community structure. They assume that the 
seed set Q is made of vertices belonging to the same commu¬ 
nity C: under this assumption they measure precision and 
recall in reconstructing C. Their main findings are that (i) 
PageRank-based methods outperform other methods, {ii) 
few iterations (two or three) of the PageRank update rule 
are sufficient for convergence, and {Hi) standard PageRank 
is to be preferred over degree-normalized PageRank [^[^. 

Closer to our goals, Faloutsos et al. address the prob¬ 
lem of hnding a subgraph that connects two query vertices 
(I<51 = 2) and contains at most b other vertices, optimizing a 
measure of proximity based on electrical-current flows. Tong 
and Faloutsos [53| extend the work of to deal with query 
sets of any size, but again having a budget h of additional 
vertices. They introduce the concept of Center-piece Sub¬ 
graph, the computation of which is based on the Hadamard 
(i.e., component-wise) product of a set of vectors, where 
each vector is obtained by doing a random walk with restart 
from a query vertex. The efhciency and scalability of the 
method is severely limited by the processing time of random 
walks with restart. Koren et al. 38 redehne proximity using 
the notion of cycle-free effective conductance and propose a 
branch and bound algorithm. 

All the approaches described above require several param¬ 
eters: common to all is the size of the required solution, 
plus all the usual parameters of PageRank methods, e.g., 
the jumpback probability, or the number of iterations. We 
recall that instead our problem definition and algorithms are 
completely parameter-free. 


Other methods. Asur and Parthasarathy introduce the 
concept of viewpoint neighborhood analysis in order to iden¬ 
tify neighbors of interest to a particular source in a dynami¬ 
cally evolving network. The authors also show a connection 
of their measure with heat diffusion. However, the method 
of Asur and Parthasarathy has several parameters, such as 
the budget, the stopping threshold, and minimum number 
of viewpoint neighborhoods for a vertex. 

More recently, Sozio and Gionis |48| provide a parameter- 
free combinatorial optimization formulation. Their problem 
asks to find a connected subgraph containing Q and max- 














imizing the minimum degree. Sozio and Gionis show that 
the problem is solvable in polynomial time and propose an 
efficient algorithm. However, their algorithm tends to re¬ 
turn extremely large solutions (it should be noted that for 
the same query Q many different optimal solutions of dif¬ 
ferent sizes exist). To circumnavigate this drawback they 
also study a constrained version of their problem, with an 
upper bound on the size of the output community. In this 
case the problem becomes NP-hard. The authors propose 
a heuristic where the quality of the solution produced (i.e., 
its minimum degree) can be arbitrarily far away from the 
optimal value of a solution to the unconstrained problem. 

Cui et al. |15| propose a local-search method to improve 
the efficiency of the algorithm by Sozio and Gionis 48 ; how¬ 
ever, their method does not solve the issue of the size of the 
solutions produced. Moreover, their method works only for 
the special case |Q| = 1. 


Steiner Tree and MAD Spanning Trees. Given a graph 
and a set of terminal vertices, the Steiner tree problem asks 
to find a minimum-cost tree that connects all terminals. 
This is an extremely well-studied problem: a plethora of 
methods to solve/approximate it and many variants of the 
problem have been dehned [22| . We will explain in detail 
(Section]^ how our Min Wiener Connector problem dif¬ 
fers from the Steiner tree problem. 

Another related problem is Minimum Average Dis¬ 
tance (MAD) Spanning Trees: given a graph G, hnd a span¬ 
ning tree of G that minimizes the average shortest-path dis¬ 
tance among all pairs of vertices . This problem is related 
to Wiener index as a MAD Spanning Tree is a spanning tree 
that minimizes the Wiener index. However, this problem 
still remains different from our Min Wiener Connector 
as the latter asks for subgraphs containing a given set of 
query vertices rather than asking to span the whole input 
graph. In a sense, our problem is to MAD Spanning Trees 
as Steiner Tree is to Minimum Spanning Tree. 


Wiener index. The notion of Wiener index is rooted in 
chemistry, where in 1947 Harry Wiener introduced it to char¬ 
acterize the topology of chemical compounds [^. In gen¬ 
eral, the Wiener index captures how well connected a set of 
vertices are, thus bearing resemblance to centrality measures 
and finding application in several fields, such as communi¬ 
cation theory, facility location, and cryptography [18] . A 
recent work also considers the Wiener index in the context 
of event detection in activity networks 46 . Existing litera¬ 
ture on Wiener index focuses on computing it efficiently , 
finding a tree that minimizes/maximizes it among all trees 

characteriz¬ 


with a prescribed degree sequence 56 61 
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ing the trees which minimize or maximize [21[ |50| the 
Wiener index among all trees of a given size and maximum 
degree, or solving the inverse Wiener index problem |20| . 

To the best of our knowledge, the problem of finding a 
minimum-Wiener-index subgraph containing a given set of 
query vertices has never been studied before. 

In our experiments in Section we will compare our 
method with prior contributions which allow \Q\ > 2: fol¬ 
lowing the Hndings of 37 we will use a standard PageRank 
(with no normalization) personalized over the query vertices 
Q (PPR for short), the so-called Center-piece Subgraph 
(CPS for short) which is closer in spirit to our goal of find¬ 
ing a connector and not a community, the so-called Cocktail 
Party Problem (ctp) which is parameter free, and the 
classic Steiner Tree (ST for short). 


1.2 Contributions and roadmap 

In this paper we initiate the study of the Min Wiener 
Connector problem, a novel parameter-free graph query 
problem, whose objective function favors small connected 
subgraphs, obtained by adding few central vertices to the 
query vertices. Beyond this main contribution, we provide 
a series of theoretical and empirical results: 


• We show that, when the number of query vertices is 
small, Min Wiener Connector can be solved exactly 
in polynomial time (fj^. However, in the general case 
our problem is NP-hard and it has no PTAS unless 
P = NP (§|: note that, while the inapproximability 
result says that the problem cannot be approximated 
within every constant, it leaves open the possibility of 
approximating it within some constant. 








In fact, our central result is an efficient constant-factor 
approximation algorithm for Min Wiener Connector 
(sQ, which runs in 0(|Q||if|) time. 

We devise integer-programming formulations of our 
problem ([|^. We use them to compare our solutions 
for small graphs with those found using state-of-the art 
solvers, and show empirically that our solutions are in¬ 
deed close to optimal (j]6.2[|. 


We empirically confirm that existing methods for query- 
dependent community extraction tend to produce large 
solutions, which become even larger when the query 
set Q is made of vertices belonging to different com¬ 
munities ((6.4|. Our method instead produces solution 
subgraphs which are smaller in size, denser, and which 
include more central nodes ((6.31, regardless of whether 
the query vertices belong to the same community or not. 


• We show interesting case-studies in biological and social 
networks, confirming that our method returns small so¬ 
lutions that include important vertices (©• 


2. PROBLEM STATEMENT 

Preliminaries. We consider simple, connected, undirected, 
unweighted graphs. We denote the vertex set (resp., edge 
set) of a graph G by V{G) (resp., E{G)). 

Given a graph G and S C V{G), we denote by GfS"] the 
subgraph of G induced by S: G[5'] = (S, E\S), where E\S = 
{{u,v) € E \ u € S,v € S}. For any connected graph H and 
u,v G V{H), let dniujv) denote the shortest-path distance 
between u and v in H. Clearly, if H is a subgraph of G, it 
holds that dG(u,v) < dH{u,v). 

The Wiener index W(jT) of a (sub)graph H is the sum 
of pairwise distances between vertices in H : 

W(H) = ^ dH{u,v), (1) 

{u,v}CV(H) 

where the sum is taken over unordered pairs. 

For ease of notation, we identify any S C V{G) with its 
induced subgraph G[5']. Thus, we use the shorthand ds{u,v) 
(resp., W(S')) to denote dQ[s]{u,v) (resp., W(G[S]). 

The input to our problem is a connected graph G and a 
set Q C V{G) of query vertices (or terminals). A connector 
for Q in G is a connected subgraph of G containing Q. 

Problem definition. In this work we aim at finding sub¬ 
graphs of the input graph that connect a given set of query 
vertices while minimizing the Wiener index. 












Problem 1 (Min Wiener Connector). Given a 
graph G = (P, E) and a query set Q G V, find a connector 
H* for Q in G with the smallest Wiener index. 



Clearly we may restrict the search to vertex sets and their 
corresponding induced subgraphs. 

Note that 'W(H) may be written as the product of (''^ 2 ^^') 
and the average distance between pairs of distinct vertices 
of H. Therefore, Problemencourages solutions that attain 
a proper tradeoff between having small pairwise distances 
and using few vertices. In fact, while adding vertices may 
decrease distances, it also increases the number of terms to 
be summed up in Eq 0. 

Hardness results. We next prove that the problem is NP- 
hard, hence unlikely to admit efficient exact solutions. In 
fact we show a stronger result, namely that it does not admit 
a polynomial-time approximation scheme: unless P = NP, 
one cannot obtain a polynomial-time c-factor approximation 
algorithm for every c > 1. This inapproximability result 
says that the problem cannot be approximated within every 
constant, but leaves open the possibility of approximating it 
to some constant; in fact we will show this possible. 

Our proof makes use of the following inapproximability 
result for Vertex Cover in Bounded Degree Graphs: 

Theorem 1 (Dinur and Safra [^). There exist 
constants d € N, a £ R"*" with the following property: given 
a degree-d graph G and an integer k £ N, it is ISlP-hard to 
distinguish instances where the minimum vertex cover of 
G has size larger than k{l -|- a) from instances where the 
minimum vertex cover of G has size at most k. 

Theorem 2. There is some constant e £ R"*" such that 
Problem^is NP-/iard to approximate within al + e factor. 

Proof. We present a gap-preserving reduction from 
Vertex Cover in Bounded Degree Graphs to Min 
Wiener Connegtor. Let a and d be as in Theorem [T] 
Let (G, k) be an instance of the decision version of Vertex 
Cover in Bounded Degree Graphs, where the degree of 
G is at most d. We need to show that there is some constant 
e > 0 such that, given {G,k}, we can construct in polyno¬ 
mial time an instance (G', Q) of Min Wiener Gonnegtor 
and a bound B £'H with the following properties: 

(a) if G has a vertex cover of size at most k, then {G',Q} 
has a connector with Wiener index at most B; 

(b) if every vertex cover of G has size larger than fc(l -|- a), 
then every connector of {G', Q) has Wiener index larger 
than B{1 -|- e). 

Let G have n vertices and m edges. We may assume that 
m > k > for otherwise the vertex cover instance can be 
answered trivially. Furthermore, for any fixed constant c we 
may assume that fc > c • d (if not, we can solve the vertex 
cover instance in polynomial time.) 

Our graph G' is built as follows. Let the vertex set of G' 
be composed of: 

• a distinguished “root” node r; 

• n vertices vi,... ,Vn corresponding to vertices of G; 

• m vertices ei,...,Cm corresponding to edges of G. 

Put an edge between r and every vertex in {ui,..., Un}, 
and connect Vi to Cj if and only if Vi is an endpoint of Cj 


Figure 2: Example showing that a solution to Steiner 
Tree may exhibit a large Wiener index. Note that similar 
situations arise also in real-world instances (see )6.4|. 


in G. Note that to every Vi correspond at most d such 
efis, and the degree of each Cj is exactly two. Finally, let 
Q = {ei,..., Cm} U {r} be the set of query vertices. 

Observe that any solution to the original Vertex Cover 
instance gives rise to a feasible solution to Min Wiener 
Connector that contains Q and a subset X C {ui,..., Us}; 
conversely, any solution to Min Wiener Connector is of 
the form Q U X, where X C {ui,..., Us} is a vertex subset 
whose corresponding vertices in G cover all the edges of G. 

We claim that, ii \X\ = t and Q U X forms a connected 
subgraph, then the Wiener index of the induced subgraph 
of G' containing Q and X is at most u{f) = t^ + Smt -|- 2m^ 
and at least fit) = u{t) — 0{md). Indeed, the contribution to 
the Wiener index from r to the t chosen vertices is exactly t, 
and its contribution to the m edges is exactly 2m. The sum 
of induced distances among the t chosen vertices is precisely 
( 2 ) • 2 = — t. The contribution of the t chosen vertices to 

ei,..., e„ is f(3m — 0{d)), because the distance from Vi to 
Cj is exactly 3 except in the case that Cj has an edge to Vi 
(meaning that ej was an endpoint of Vi in the original graph 
G, and there are at most d such edges for any Vi). Similarly, 
the contribution from ei,...,Cm to themselves is |m(4m — 
0{d)) = 2m? — O find). The total is t^-|-2m^-|-3fm —0(md). 

If we pick B = u{k), we know that condition (a) is sat¬ 
isfied. Regarding condition (b), a straightforward computa¬ 
tion shows that u{k) < 6d^k^ and u{k{l + a)) > u{k) + 5ak^, 
hence using the fact that m < fed we get 

Ifil + a)k) — u{k) ^ 5ak^ — 0{md) ^ 5a 0(1) 
u{k) ~ Gd'^k'^ ~ 6d^ k 

For some large enough fco = Q{d?/a), let e = 
the above shows that e > 0 and 

Z((l -I- a)k) — u{k) 

uik) ^ 

so condition (b) holds as well, completing the proof. □ 


Min Wiener Connector vs. Steiner tree. At first 
glance. Problem may resemble the well-known Steiner 
Tree problem (see, e.g, [^): given a graph G and a set Q 
of terminal vertices of G, find a minimum-sized connector 
for Q. (The size is measured as the total cost of edges used, 
which for unweighted graphs is one less than the number of 
vertices used.) Such an optimal subgraph must be a tree. 

Although related (see Section , the two problems are 
different. In fact, a solution for Steiner Tree may be arbi¬ 
trarily bad for Min Wiener Connector. To see this, take 
a look at the graph in Figure]^ Let Q = {ui,...,uio} 
be the set of query/terminal vertices. The unique opti¬ 
mal solution to the Steiner Tree problem is obviously 
Q itself, which has Wiener index W(Q) = 165. How¬ 
ever, adding either vertex r\ or r 2 would lower the index 
to W(Q U {ti}) = W(Q U {T2}) = 151, and the optimal 
solution to our Min Wiener Connector problem is given 
by Qu{ri,r 2 }, which has Wiener index 142. Also note that 







no tree is an optimal solution to this example, showing that 
the addition of extra edges may help decrease the cost. 

In general, the fact that the Steiner Tree problem seeks 
connectors with as few edges/vertices as possible hinders the 
minimization of pairwise distances. We can generalize the 
example in Figure]^ to a line of length h and a root r con¬ 
nected to all of them. The optimal Steiner Tree solution 
exhibits a Wiener index of determined by the Q{h^) 

pairs of vertices and an average distance of Q{h); on the 
other hand, a solution to Min Wiener Connector can in¬ 
clude r so as to achieve constant average distance, lowering 
the Wiener index to 0{h^). 

3. AN EXACT ALGORITHM 

Here we address the question of how to solve the Min 
Wiener Connector problem exactly. If the input graph 
has n vertices, a straightforward solution would be to try 
all 2" vertex subsets and compute their Wiener index; this 
gives a running time of 2" poly(n). On the other hand, there 
are some polynomial-time solvable special cases. As an ex¬ 
ample, for unweighted graphs (like the ones studied in this 
work), when \Q\ = 2, any shortest path between the two ter¬ 
minals yields an optimal solution. As many problems like 
satisfiability, coloring, etc., turn from easy to NP-hard as 
the “size” parameter switches from 2 to 3, it is natural to 
wonder if the same happens here. Interestingly, the answer 
is negative: the problem admits an exact algorithm that 
runs in polynomial time for any fixed bound on the maxi¬ 
mum size of the query set. The result is of limited practical 
interest, but gives insight into the nature of the problem. 


relaxation where distances are measured according to the 
original graph; using the techniques developed by [35| to 
find light approximate shortest path trees, we show how to 
make use of a solution to this relaxation. Then we apply 
a linearization technique to show that if we knew a certain 
parameter A controlling the ratio between the size of the op¬ 
timal solution and the sum of distances to r in the optimal 
solution, we could reduce our problem to Node-weighted 
Steiner Tree. It turns out that our particular instances 
of the latter problem admit an 0(l)-approximation (unlike 
the general case). Finally, we explain how to search quickly 
for the correct values of r and A; as an optimization, we also 
prove that we can further restrict the search of candidates 
for r. Finally, we combine these arguments to prove the 
correctness and efficiency of our algorithm. 

Step 1: from Min Wiener Connector to Min-A Con¬ 
nector. First we need the following lemma, whose proof 
may be found in Appendix) A. 2[ 


Lemma 1. For any graph H, 


min 

reV(,H) 


E 

vev{H) 


dniv, r) < 


2W{H) 

\vm 


< 2 min 

reV(H) 


E 

vev(H) 


dH{v,r). 


Lemma [^justifies the introduction of the following prob¬ 
lem. Given a subgraph H of G and r € V{H), let 

A{H,r) = \ViH)\- dH{n,r) 

v.ev(H) 

A(H) = min A(V(H),r). 

reviH) 


Theorem 3. The Min Wiener Connector problem 
can he solved in polynomial time when \Q\ — 0(1). 

That is, there is a function / : N —>■ N such that the Min 
Wiener Connector problem can be solved in time 

The proof is in Appendix ) A. 1| The intuition is that an op¬ 
timal solution has /(|Q|) = poly(|Q|) “pivotal” vertices that 
are useful in connecting several query vertices together or are 
query vertices themselves. Any other vertices in the solution 
are simply “pass-through” vertices needed to connect pairs 
of pivotal vertices via shortest paths; they could be replaced 
by vertices in another arbitrary shortest path between the 
required pivotal vertices. Thus, if we try all possible sets of 
pivotal vertices and connection patterns among them, and 
then hnd shortest paths in G to actually connect them, we 
are guaranteed to find an optimal solution. 

4. AN APPROXIMATION ALGORITHM 

As we cannot hope for efficient exact solutions to Min 
Wiener Connector, in this section we design an efficient 
algorithm with provable approximation guarantees. Specih- 
cally, we achieve a constant-factor approximation in roughly 
the same time it takes to compute shortest-path distances 
from the terminals to every other vertex in the graph. 

Proof outline. We need to introduce a series of relax¬ 
ations of Min Wiener Connector to arrive at a prob¬ 
lem for which it is easier to derive an approximation algo¬ 
rithm. First we show that we can approximate the cost of 
any solution in terms of the number of vertices in it and 
the single-source shortest-path distances to a suitably cho¬ 
sen root vertex r G V{G). Then we introduce a further 


Problem 2 (Min-A Connector). Given a graph G 
and a query set Q C V{G), find a connector H for Q in 
G minimizing A{H). 

Note that standard Steiner tree problems do not minimize 
A{H), but the number (or total cost) of the edges in H. 

Corollary 1. Any a-approximate solution to Problem^ 
is a 2a-approximate solution to Min Wiener Connector. 

Step 2: from Min-A Connector to Min Weak-A 
Rooted Connector via distance adjustments. One ap¬ 
proach to solve Problem is to “guess” the correct vertex r 
and then find a connector H for Q that minimizes A{H,r). 
However, the objective function depends on the induced dis¬ 
tances of the unknown solution. In order to simplify our 
task, we now introduce a “weak” relaxation of the above 
problem where shortest-path distances are measured in the 
input graph G instead. 

Given a subgraph H of G and a vertex r G V{H), dehne 
AiH,r) = \V{H)\- Y. dG{u,r) (2) 

uev{H) 

Problem 3 (Min Weak-A Rooted Connector). 
Given graph G, root r G V{G) and query set Q C V{G), 
find a Steiner tree T for Q in G minimizing A{T). 

Here we insist that the solution be a tree (unlike in Prob¬ 
lem where we allowed non-tree solutions, even though an 
optimal solution may easily seen to be a tree as well). The 
reason will become apparent shortly. 



We are now faced with an additional complication, namely 
that a good solution to Min Weak-A Rooted Connec¬ 
tor may not give a good solution to Min-A Connector. 
Hence the need to perform a post-processing step on every 
candidate solution to ensure that distances in the modified 
solution resembles distances in G as closely as possible. 

Lemma 2. Let T be a subtree of G and r £ V(T). There 
is another subtree T' of G with the following properties: 


We next show that, by choosing A in the proper way, any 
approximate solution to Problem yields an approximate 
solution to Problem too. The right choice of A is given by 
the following lemma, proved in Appendix ) A. 4| 

Lemma 3. For any graph G with |P(G)| > 2, query set 
Q C ViG) and r G V{G), there is \ £ [1/^2, ^\vIg)\] 
such tha^or any a £ every a-approximate solution to 
Problem^is also an -approximate solution to Pro6Zem[^ 


(a) V{T') 3 V{T); 

(b) \V{T')\ < (l + ^)|P(r)|; 

(e) for all v £ V{T'), dT>{r,v) < (1 -I- \/2) dG{r,v). 

(d) E„ev(T') dG{r,v) < Y.v&v[T)dG{r,v). 

Furthermore, given T, a BFS tree from r in G, and dG (r, v) 
for all V £ V(G), it is possible to construct T' in time 
0{\V{T)\). 

This follows from a slight modihcation of an algorithm by 
Khuller et al for balancing spanning trees and shortest-path 
trees [35[ Lemma 3.2]; although they state it for minimum 
spanning trees and shortest path trees with the same vertex 
set, a careful examination of their proof establishes Lemma[^ 
as well. For completeness, we reproduce the proof in Ap¬ 
pendix |A.3| 

Corollary 2. Any a-approximation to Problem can 
be used to obtain a {A a-approximation to Problem^ 

Proof. We can try all possible choices of r. For each of 
them, let T be an a-approximation to Problem]^ Then we 
can find a tree T' as in Lemma Since ViT) F ViT'), T' 
is also a connector for Q and satisfies 

HT',r) = \V{T')\ Y. dG{r,v) 

v^V{T') 

<{l + V2)\V{T)\ Y dG(r,v) 

vev(T') 

<{l + V2)^/2\V{T)\ Y dG{r,v) 

vev{T) 

= {2 + V2)A{T,r), 

and A(r', r) < (1 -b V2)A{T', r) < (4 -b 2,y/T)A{T, r). □ 


Step 3: from Min Weak-A Rooted Connector to 
Min-B Rooted Steiner Tree. We further relax Prob¬ 
lem so as to employ a modified objective function where 
the product between the number of vertices in H and the 
sum of original distances to the chosen root r is replaced 
with a linear combination of the two. The rationale here is 
to make the overall objective function linear and, as such, 
more amenable to standard approximation techniques. 

Given (a subgraph induced by) a subset of vertices Pf C 
V{G), a root vertex r £ V{H), and a parameter A G R”*", 
the modihed objective we consider is: 


B(H,r,A) 


A \H\ + 


(3) 


Problem 4 (Min-B Rooted Steiner Tree). Given 
a graph G, a query set Q C V{G), a root vertex r £ V, and 
a parameter A G R^, find a Steiner tree for Q U {r} in G 
minimizing B(H,r,A). 


Step 4: approximating Min-B Rooted Steiner Tree. 

Our next step aims to find approximate solutions to Prob¬ 
lem]^ To this end, we note that Problem]^ can be cast as a 
Node-WEIGHTED Steiner Tree problem, where the cost of 
a node u is equal to XpdGfr, u)/\. However, no approxima¬ 
tion factor better than f2(log|Q|) is possible in general for 
Node-weighted Steiner Tree, unless every problem in 
NP can be solved in quasipolynomial time [^. Neverthe¬ 
less, we show that our particular problem admits a constant- 
factor approximation, by shifting the cost from vertices to 
edges and reducing it to a classical Steiner Tree problem. 
The reason is that in our instance the cost of two adjacent 
vertices from the root r cannot differ by more than 1, thus 
the overall solution cost is nearly preserved despite the cost 
shift. We formalize this intuition next. 


Lemma 4. Given a graph G = {V, E), a query set Q QV, 
a root vertex r £ V, and a parameter A G R^, let Gr,x be a 
weighted graph with vertex set V , edge set E, and weight on 
each edge {u,v) equal to w{u,v) = A -b ^ 

Then any Steiner tree T for Q U {r} satisfies the following: 

B(r, r. A) —A < Y^ w{u,v) < 2 (B(r, r. A) — A). 

(u,v)eE(T) 

Proof. Observe that the cost w{u,v) of each edge 
(u, v) £ E{T) lies in the range [A-bdG(T, u)/X, X+{dG{r, u) + 
1)/A], as u and v are adjacent in T. Notice that in every 
edge (u, v) of T, either u is the parent of u or u is the parent 
of u. Hence, writing A = V (T) \ {r}, we can bound 

ti6A' ' {u,v)eE(T) uGa'^ 

' -V-' 

B(T,r,A)-A 

The result follows by noticing that the right-hand side is at 
most 

B(r,r,A)-A + < 2 (B(r,r,A)-A). 

□ 

Lemma |4] entails a reduction from Problem |4] to the well- 
studied Steiner Tree tree problem. The best known al¬ 
gorithm for the latter is the 1.39-factor approximation algo¬ 
rithm of Byrka et al. [^. However, it is based on solving 
a linear program, in contrast to quicker combinatorial algo¬ 
rithms that achieve a factor-2 approximation. The fastest 
among the latter is due to Mehlhorn [41] . 

Corollary 3. A 4-approximation to Problem can be 
computed in time 0(|iJ| -bjCj log \V\), provided that shortest- 
path distances from Q in G have been precomputed. 

Proof. We can construct the graph of Lemma|^in t ime 
0(|H| -b |i5|), and use the 2-approximation algorithm of 
for Steiner tree, which runs in time 0(|i!i| -b \V\ log |H|). By 
Lemma]^ the result is a 4-approximation for Problem]^ □ 










step 5: choosing r and A. At this point, we know that, 
with the right choice of A (which depends on the problem 
instance), we can get a constant-factor approximate solution 
to Problem For any given graph G and query set Q C 
V{G), the algorithm would run as follows: 

• For every vertex r G V do: 

• Compute dG{r,u) from r to every other vertex u; 

• Guess A matching the value stated in Lemma 

• Construct the weighted graph Gr,\ of Lemma]^ 

• Find an a-approximate solution S* to the Steiner 
Tree problem on graph Gr,\ and terminals Qu{r}; 

• Take the S* that minimizes A). 

However, we still need to explain how to guess A. Since 
there are only poly(|l/(G)|) many possible values for A^, we 
could try all of them in polynomial time. A faster way is 
to fix some (J > 0 and then try all powers of (1 -I- /3) in the 
interval [^1/2, \/|F[], of which there are only 0(log \V\/P) 
many; this will guarantee that one of the candidate values 
of A tried will be off by a factor of at most 1-I-/3. It is not hard 
to generalize Lemma to show that using a. 1 -\- P approxi¬ 
mation for the true value of A results in the loss of another 
multiplicative (1 -f P)"^ factor in the overall approximation. 
Step 6: restricting the number of root vertices. Fi¬ 
nally, we show that trying all possible root vertices r £ V 
is overkill if we are willing to settle for a somewhat larger 
approximation factor. The next result shows that we can 
restrict our search to elements of the query set (notice that 
an optimal solution to Problem]^ is a tree with leaves in Q). 

Lemma 5. Let T be a tree, r £ V{T), and let x* 
be a leaf of T closest to r. Then X]usy{T)‘^T(a:*, w) < 
3 Y,uev(T) d-T{r, u), hence A(T, **) < 3 • A(T, r). 

Proof. For any vertex x £ V{T), let d{x) = 

'^uev(T)^T{u,x). It suffices to show that d{x*) — d(r) < 
2d{r). To this end, partition V{T) into levels according 
to the distance to r: Li = {u £ V{T) \ dT{r,u) = i}. 
Let i = dT{r,x*) and for t G N write L<t = [Jj^fLj, 
Lyt = Uj>t ^ 3 - hand, 

d{x*)-d{r) = ^ (dT{u,x*) - dT{u,r)) (4) 

uev(T) 

- \dT{u,x*) - dT{u,r)\ < {\L<e\ + \Lye\) £. 

uev(T) 

On the other hand, observe that by our choice of x*, it is 
guaranteed that |Lo| < |Li| < ... < \Lt\, as every vertex 
at level i < I has at least one child (and they are distinct 
as H is acyclic). This implies that we can partition L<f 
into a collection of pairs {a, 6} where a jb b and dxir, a) -|- 
dT{r,b) > I, possibly along with a singleton element from 
L>e/ 2 - Therefore, the average distance from the elements of 
L<i to r is at least I'/2. Furthermore, every element of Lye 
is at distance > £ from r by definition. Hence 

d{r)>\L<e\^- + \Lye\i£ + l). (5) 

Combining Equations Q and Q yields the result. □ 

Putting it all together. The pseudocode for our approach 
is shown as Algorithm[^ The following theorem summarizes 
the results about solution quality and running time. 


Algorithm 1 WienerSteiner 

Input: A graph G = (E, E); a set of query vertices Q G V. 
Output: A set of vertices Q C H* C V. 

1: For all g E Q and for all u £V, compute dQ{q,u) 

2: "H = 0 > set of candidate solutions 

3: /3 •<— any constant >0 > e.g., /3 = 1 

4: for t = 1,..., riogi+^ 1^11 do 

5: A (1 + /3)^ > guess the right balance 

6: for r £ Q do > guess a “root” vertex 

t> Compute Gr^x = {V,E,w) (Lemmapl 
7: for (w, tj) E E do 

: ■!«(«, d) <— A H- L n 

9: end for 

10: T <— ApPROxSTEINERTREE(Gr^A, Q) 

11: H -I— AdjustDistances(T) > see Lemmal^ 

12 : 

13: end for 

14: end for 

15: H* •<— arg min^j:/A(i4, r) l> see Remark [T| 


Theorem 4. There is a constant-factor approximation 
algorithm for the Min Wiener Connector problem run¬ 
ning in O (|g| {\E\ log |P| + |1/| log2 |P|)). 

Proof. First we prove correctness. Let H denote an op¬ 
timal solution to Min Wiener Connector. By Lemma 
A{H) < 2 • W(iL), so there is r G V{H) with A(H,r) < 

2- W{H). By Lemma there exists q £ Q with A{H, q) < 

3- A(H,r) < 6 • W(H); henceforth we take q to be the 
“root” vertex in our problems. Let K denote an opti¬ 
mal solution to Problem with root q-, clearly AfA, q) < 
A{H,q) < A[H, q). Let A G be as in Lemmaand let 
L V (G) be an optimal solution to Proble m [4| Then for 
any connector X, the conclusion of Lemma Jsjsays that if 
B(A, q, A) < aB(L, g, A), then A(X, q) < a^A{K, q). 

The main loop is guaranteed to try at some point this 
choice of q and also a (1 -|- /3)-approximation X for A. It is 
readily seen that, for any Y, B(y, q, A') < (1 -1- P)B{Y, q, A). 
By Corollary we can find an 4-factor approximation to 
Problem 1^ with q and A'; in particular we find X, q and A' 
withB(A,g, A') < 4B(L,g^A') < 4{l-£ P)B{L,q, X). There¬ 
fore A{X, q) < 16(1 -k PfA{K, q) < 96(1 -k pfW{H). 

By Corollary!^ line 11 obtains a graph X' with 

A(A', q) < (4 -k 3V2)A{X, q) < 96(1 -k Pf{4 -k 3V2)W{H). 
Therefore, another application of Lemma tells us that 

W(X') < A(X', r) < 792(1 -k pfW(H) = 0(W(H)), 
as we wished to prove. 

As for the running time, computing the initial shortest- 
path distances (Line 1) takes 0{\Q\ (|P| -k |15|)) time, while 
the main loop in Lines 3-14 is repeated 0(|Q| log |14|) times. 
Lines 6-10 compute the weighted graph Gr,x and find an 
approximated Steiner tree, thereby solving Problem El By 
Corollary El they run in time 0{\E\ -k |y|log|l/|). Line 11 
adjusts large distances and run in linear time (Lemma |^. 
Finally, computing A{H, r) in Line 15 can be done in 
linear time for each element of TL (of which there are 
0(giM|17|)). In summary, the overall runtime of Algo- 
rithm0isO(|g| (|£|log|l/| + |P|log"|P|)). □ 

Remark 1. The last line of Algorithm^ is intended to 
return the best solution found. Lt may be replaced with 
H* ■£- aigmmH\(H,r)en YV{H), which can only lead to better 









solutions. The trouble is that computing W(_ff) exactly may 
be very costly for large H; this poses no difficulty in practice 
as the sets found are typically small. However, for the worst- 
case analysis of the running time bounds, it is important to 
use A{H, r) as a proxy for the actual Wiener index 'W{H). 

5. LOWER BOUNDS 

In this section we design methods to prove lower bounds 
on the optimal Wiener index. The idea is to have a way 
to somehow compare the Wiener index of the solution out¬ 
putted by our method with the optimum. As the optimal 
solution is unknown, we compare against a lower bound on 
its cost. While this is pessimistic approach, proving that 
our solutions are close to the lower bound allows us to state 
with certainty that they are close to optimal as well. 

To compute the desired lower bound, we show an integer¬ 
programming formulation of the Min Wiener Connector 
problem. Let S denote the vertices in a feasible solution, 
i.e., a connector of Q in G. We set a variable yu to 1 for 
each u G S (in particular i/u = 1 for all u £ Q), and another 
variable Pst for each pair s,t £ V (G) x V (G). In the intended 
solution, Pst = 1 iff 1 /s = t/t = 1; we model this by the linear 
constraint pst > j/s + i/t — 1. Notice that the connectivity 
requirement is equivalent to being able to route an unit of 
flow from s to t whenever pst = 1. We add two variables 
and ffl, for each edge {«, u} in G and each pair s,t £ V\ fff 
which will be set to one when a fixed shortest path from s 
to t traverses edges u to u in that direction. For each s, t and 
V £ F \ {s, t}, the flow constraints indicate that the net flow 
through V is zero: J2ueN(v)[fuv - fSl] = 0, where N{v) are 
the neighbours of v in G. Also, the net flow through s must 
be —pst and for t must be Pst- Since ds{s, t) = „ fff and 

the latter sum vanishes when Pst = 0, W(S) = | fui- 
The complete program is shown next. 




u,v,s,t 


1 -pst if 11 = s 


/ V LJ UV J vui 

= 

< Pst ifv = t 

Vs, t,v £ V 

.eN(v) 


1 0 otherwise 


fst 

J UV 

< 

Vu 

Vlii, v} £ E 

pst 

> 

Vs +yt -t 

Vs,t£V 

Vu 

= 

1 

yu £ Q 

fuv^Pst 

> 

0 


Vu 

e 

{0,1} 



( 6 ) 


Theorem 5. Program models the Min Wiener Con¬ 
nector problem. 

The proof is reported in Appendix A.5| 

Program 0 uses more than 2|iJ|fVf variables and more 
than |F|® constraints, which can be problematic for large 
graphs. A way to reduce the size of the program is to ask for 
minimization of the pairwise sum of distances in the original 
graph: this is a safe relaxation as our solutions typically 
respect the original distances. Applying this relaxation, the 
objective function becomes a linear function of Ps.t, thus 
eliminating the need for separate flow variables for each s, t 
pair and leading to a program significantly smaller in size. 

Let yu and Pst be as before. The Wiener index of any 
solution is at least 'Yf.u v dG{u,v) ■ Puv. To express the con¬ 
dition that the variables with = 1 form a connected 


subgraph, we add two variables Xuv and Xvu for each edge 
(u, v) of G. Pick an arbitrary q £ Q and any directed 
spanning tree Tq of S rooted at g; the intended solution 
will have Xuv = 1 if and only if v is the parent of u in 
Tq. One constraint is that Xuv + Xvu < yu’ edge {u,v) 
can be used only in one direction, and in order to use it 
from u to V, we must choose u as well. Also, for any uf^q, 
"^usNiv) ~ 21'' chosen vertex must have exactly one 
parent in Tq). Finally, we need to make sure that the edges 
with Xuv + Xvu = 1 form an undirected tree. A tree with k 
vertices has k — 1 edges and no cycles; hence we enforce the 
constraint „ [^vv + Xvu] = X^ i/u ~ 1 and, in order to avoid 
cycles, we add constraints saying that the sum of Xuv + Xvu 
for all edges (w, v) in every cycle (7 of G is at most \G\ — 1. 


min 


s.t. 


\^dc(s,t) - Pst 

S,t 


'y ^ ^uv 

= 

Vu 

yv £V \ {g} 

uGN(v) 





= 

Jfu Vv. — i 


{u,v}£E 





< 

t — 1 Vcycle 

II 

o 

^uv H” ^vu 

< 

Vu 

V{it, v} £ E 

Pst 

> 

Vs-\-yt-i 

ys,t£V 

Vu 

= 

1 

yu £ Q 

^UV 1 Pst 

> 

0 


Vu 

£ 

{0,1} 



(7) 


We reduced the number of variables to a more manageable 
0{V^), in exchange for exponentially many constraints (one 
per cycle in G). This is not a serious issue because the 
program above has a separation oracle [27], and commercial 


solvers support the addition of lazy constraints 28 


6. EXPERIMENTS 

In this section we report the results of our empirical anal¬ 
ysis. Here we anticipate the main findings: 

• Our approximation algorithm produces solutions which 
are close to optimal (j |6.2[ ). 

• When compared to other concepts of query-dependent 
subgraphs extraction such as personalized PageRank 
[37] (ppr). Center-piece Subgraph [^ (cps), or the 
Cocktail Party Subgraph [^ (ctp), the minimum 
Wiener connector is several orders of magnitude smaller 
in size, it is much denser, and it includes vertices with 
higher centrality ((6.31. 

• When the query set Q includes vertices belonging to dif¬ 
ferent communities, PPR, CPS, and CTP return solutions 
that are 5 to 10 times larger than the case where the 
whole of Q belongs to the same community. The mini¬ 
mum Wiener connector is only slightly larger (i |6.4[ ). 

• Steiner tree produces solutions that are much closer to 
the minimum Wiener connector than the other meth¬ 
ods. However, in addition to having smaller Wiener in¬ 
dex ((6.51, the Steiner-tree solutions are nearly always 
less dense, and include vertices with lower centrality. 
Also, interestingly, the size of our solutions is compara¬ 
ble to the size of Steiner-tree solutions, despite the fact 
that Steiner tree explicitly optimizes for solution size. 










Table 1: Summary of graphs used. <5: density, ad: aver¬ 
age degree, cc: clustering coefficient, ed: effective diame¬ 
ter. Datasets with ground truth communities (*). Classical 
Steiner Tree benchmarks with given query workload {if)- 


Dataset 

IV’l 

\E\ 

<5 

ad 

CC 

ed 

football 

115 

613 

9.4e-2 

21.3 

0.40 

3.9 

jazz 

198 

2742 

1.4e-l 

55.4 

0.62 

3.8 

celegans 

453 

2025 

2.0e-2 

17.9 

0.65 

4.0 

email 

1133 

5452 

8.5e-3 

9.62 

0.22 

8 

yeast 

2224 

6609 

2.6e-3 

5.94 

0.14 

11 

Oregon 

10670 

22002 

3.8e-4 

4.12 

0.30 

4.4 

astro 

18772 

198110 

l.le-3 

22.0 

0.63 

5 

dblp* 

317080 

1049866 

2.1e-5 

6.62 

0.63 

8.2 

youtube* 

1134890 

2987624 

4.6e-6 

5.27 

0.08 

6.5 

wiki 

2394385 

5021410 

1.8e-6 

4.19 

0.22 

3.9 

livejournal 

3997962 

34681189 

4.3e-6 

17.3 

0.28 

6.5 

twitter 

11316811 

85331846 

1.3e-6 

15.1 

0.09 

5.9 

dbpedia 

18268992 

172183984 

l.Oe-6 

18.9 

0.17 

5.0 

puc"^ 

64-4096 

448-24574 

- 

- 

- 

- 

Vienna^ 

1991-8755 

3176-14449 

- 

- 

- 

- 


6.1 Experimental set up 

Algorithms. We compare our algorithm WS-Q with several 
alternative methods described next. Following the literature 
on random walks with restart [M 2^ 59 , CPS is initial¬ 
ized with a restart parameter c=0.85, number of iterations 
m=100, and a convergence error threshold ^=10“^. To al¬ 
low CPS to converge to the best possible solution, no budget 
constraint is given a priori: we greedily add to the solution 
the highest-score vertex, until we connect the vertices in Q. 
For the personalized PageRank method, ppr, we use the 
same settings as CPS, as well as the same way of selecting 
which and how many vertices to add to the solution. 

For CTP 48 we found that the parameter-free version typ¬ 


ically returns too large solutions (often with a size compa¬ 
rable to the original graph). In order to limit the size of the 
solutions returned while keeping it parameter-free, we first 
execute a BFS from each query vertex until all other vertices 
in Q are connected, among all these subgraphs we pick the 
smallest one, and run over it the greedy algorithm of [48| . 
For Steiner tree (st) we use the approximation algorithm 
by Mehlhorn [41| , which is the same that ws-Q uses inter¬ 
nally to solve the Steiner tree instances it generates (©• 
All algorithms are implemented in C++. 

Datasets and query workloads. We use real-world 
publicly-available graphs of various types and sizes, span¬ 
ning different domains: communication over emails and wiki 
pages, citation and co-authorship networks, road networks, 
social networks, and web graphs (Table [^. 

Small datasets are used for assessing approximation qual¬ 
ity ((6.21 of our algorithm ws-Qw.r.t. the best provable 
bounds obtained by solving the integer program in 

Medium-large datasets are used for characterizing the so¬ 
lutions produced by the various algorithms described above, 
in terms of size, density, and centrality ( ^6.3[ ). 

In all these datasets, the query workloads are made of 
random query-sets Q, with controlled size and average dis¬ 
tance of the query vertices. Datasets marked with (*) con¬ 
tain ground-truth community structure [58| : these are used 
to create different workloads with query vertices in Q be¬ 
longing to the same community or to different communities 
( ^6.4[ ). As we delve deeper in the comparison between Min 
Wiener Connector and Steiner tree ((6.51, we use bench¬ 
marks with predefined query workloads which are used for 


Table 2: Comparison of the Wiener index of ws-Q’s solu¬ 
tion with the lower {Gl) and upper {Gu) bounds found by 
Gurobi solver for different datasets and query set sizes. The 
cost of the optimal solution is guaranteed to be in \Gl, Gu]- 
(Numbers based on the best lower bound the solver could 
prove before it ran out of memory, they give an upper bound 
on the error that is likely to be an overestimate. 


Dataset 

IQ! 

WS-Q 

Gu 

Gl 

Error interval 


3 

40 

40 

40 

0 

nj 

5 

172 

172 

164 

[0, 4.9%] 

O 

10 

656 

598 

538 

[9.6%, 22%j 

>£ 

20 

2352 

2018 

1546+ 

[16.5%, 52.2%+] 


3 

16 

16 

16 

0 

N 

5 

44 

44 

44 

0 

ro 

10 

276 

276 

260 

[0, 6.2%] 


20 

1014 

964 

936 

[5.1%,8.4%] 


3 

36 

36 

36 

0 

c 

nj 

5 

106 

106 

106 

0 

bX) 

<l> 

10 

330 

330 

326 

[0,1.3%] 

(V 

u 

20 

1204 

1196 

1192 

[0.66%, 1.1%] 


3 

58 

58 

58 

0 


5 

250 

250 

240 

[0, 4.2%] 

E 

10 

1352 

1208 

1033+ 

[11.9%,30.9%+i 

20 

5490 

5490 

4032+ 

[0, 36.2%+i 


assessing Steiner tree algorithms]^ These are marked with 
{if) in Table benchmark puc contains 25 problems on 
small graphs with \Q\ £ [8,2048], while benchmark Vienna 
contains 85 problems with |Q| £ [5 0, ~ 5A:]. 

For scalability assessment (j ]6.6| l we use the larger graphs 
in Table [2 plus synthetic graphs generated according to the 
Erdos-Renyi and Power-Law models]^ 

6.2 Approximation quality 

Table reports the Wiener index of the solution pro¬ 
duced by WS-Q, and how it compares with the best provable 
bounds obtained with the integer-programming formulation 
reported in Program and the state-of-the-art Gurobi 

solver 1^ . This comparison was carried out on small graphs 
as otherwise the number of variables would be too large to 
even formulate the integer program. We initialize the solver 
with our solution so that the solver’s upper bound can never 
be worse by construction. A match in the solver’s upper 
and lower bounds indicates an optimal solution was found. 
When they do not coincide, either there is a gap between 
the best solution and the lower bound from Program 0 , or 
the solver ran out of memory during the optimization phase 
(in which case we report the best lower bound found so far). 

We also report an error interval obtained by comparing 
our solution with the solver’s best upper and lower bounds. 
Observe that, for small query sets (three to five vertices), 
WS-Q produces solutions that are optimal or very close to it 
(with error in the interval [0,5%]). The worst discrepancy 
between our and the solver’s best solution is 16.5% (football 
with \Q\ = 20); and here all we can prove is that our solu¬ 
tion is at most 52.2% from optimal. However, note in this 
case there is also a significant gap between the solver’s own 
lower and upper bounds, thus 52.2% is likely to be an over¬ 
estimate. It should also be noted that this query set size is 
approximately 1/5 the size of the whole vertex set V- 

'http://steinlib.zib.de/ 

^http://snap.stanford.edu/snap/index.html 


















Table 3: Main characteristics of the solution H returned by 
different algorithms on 6 datasets, with \Q\ = 10 and aver¬ 
age distance of 4 among the vertices in Q. Each experiment 
is run 5 times and we report averages of size of the solu¬ 
tion ll/l/fjl, density of the solutions S{H) = 
average betweenness centrality hc{H) of vertices in H, and 
Wiener index 'W{H). 
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Figure 3: Left column: fixed average distance AD — 4 
among query vertices, varying \Q\. Right column: fixed 
query set size |Q| = 5, varying average distance among query 
vertices. We report \V{H)\, 5{H), and bc{H) on Oregon. 


6.3 Solution characterization 

Table|^and Figure [^report a characterization of the solu¬ 
tions produced by the various algorithms in terms of number 
of vertices in the solution {\V{H)\), density of the solution 
{S{H)) and betweenness centrality of the vertices in the so¬ 
lution (bc{H)). Table 1^ reports results for various graphs 


Table 4: Average solution size for query workloads based on 
ground-truth communities: dc = query vertices in different 
communities, sc = query vertices in the same community, 
and dc/sc = the ratio of the previous two columns. 
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with a fixed size of Q, and a fixed average distance among 
the vertices in Q, while Figure shows the same statistics 
for a single dataset (oregon) with varying size of Q and av¬ 
erage distance of the query vertices. Results confirm that 
WS-Q produces solutions which are always smaller, denser 
and contain vertices with higher betweenness centrality than 
the other methods. The difference is striking with all the 
methods, with Steiner tree being much closer to the type of 
solutions produced by ws-Q. As expected, since the other 
methods do not try to optimize it, ws-Q produces solutions 
with a Wiener index that is orders of magnitude smaller. 
Moreover, the solutions ws-Q provides have much smaller 
index the Steiner-tree solutions. A deeper comparison be¬ 
tween WS-Q and ST is reported in Section [6.5| 

6.4 Ground-truth communities workload 

Next, we compare the behavior of the various methods 
when the query set Q belongs to a community or to multi¬ 
ple communities. To this end, we use graphs with ground- 
truth community structure (dblp and youtube) and produce 
two query workloads for each graph: one with query ver¬ 
tices belonging the same community (denoted sc) and one 
with query vertices coming from different communities (de¬ 
noted dc). Each workload contains 40 queries, 10 for each 
size \Q\ £ {3, 5,10, 20}. For sc workloads, we pick the com¬ 
munity at random, but avoiding small communities (of size 
smaller than 100 vertices). 

The results are reported in Table We observe that 
when Q belongs to multiple communities, random-walk- 
based methods (ppr and CPS) produce solutions which are 
from 7 to 11 times larger than when Q belongs to only one 
community. While the ratio is less striking for CTP (3 to 5 
times larger), the solutions produced are already extremely 
large in both workloads. 

The results confirm that these methods are conceived to 
reconstruct a community around a given seed set of ver¬ 
tices Q, implicitly assumed to belong to the same commu¬ 
nity; thus, they tend to return significantly larger results 
when this does not hold. By contrast, ws-Q and ST do not 
rely on such assumptions, and the difference in average so¬ 
lution size between the two workloads is much smaller. 

6.5 Comparison on Steiner tree benchmarks 

We have shown that the types of solutions produced by 
community-oriented methods have very different character¬ 
istics from those by the minimum Wiener connector and the 
Steiner tree. We next delve deeper in the comparison be¬ 
tween the minimum Wiener connector and the Steiner tree, 
using Steiner-tree benchmarks with predefined query work¬ 
loads, and focusing on the two objective functions of the two 
problems: the size of the solution (Steiner) and the sum of 
the pairwise shortest-path distances (Wiener). 
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Figure 6: A minimum Wiener connector extracted from a 
PPI network and links genes associated with cancer and 
Alzheimer’s disease. 


Figure 4: CDFs of the ratio of (a) solution size and (b) 
Wiener index on benchmarks Vienna and puc. 

Figure reports the cumulative distribution functions 
(CDFs) of the ratio of solution size (left) and Wiener in¬ 
dex (right) between ST and WS-Q, on the two benchmarks 
puc and Vienna. As expected, WS-Q produces solutions which 
have a much smaller sum of the pairwise shortest-path dis¬ 
tances. Also interesting to observe is the fact that our algo¬ 
rithm WS-Q often outperforms the well-established Steiner- 
tree approximation-algorithm by Mehlhorn with respect 
to the size of the solution, which is the objective function of 
the Steiner tree (recall that the Min Wiener Connector 
objective function implicitly favors small solutions). 

Interestingly, we also observe that many problem in¬ 
stances on the Vienna benchmark are real-world instances 
of the situation depicted earlier in Figure [^- that is to say 
WS-Q produces a slightly larger solution in number of ver¬ 
tices, yet with a significantly smaller Wiener index. 

6.6 Scalability 

We now focus on ws-Q’s runtime performance and seal- 
ability with an increasing graph or query set size. We use 
Erdos-Renyi [ER) and Power-Law {PL) models to generate 
synthetic graphs with varying size, while keeping constant 
other graph properties. We also use the larger real-world 
graphs in Table Results are reported in Figure First, 
we note that the performance of the algorithm is not signifi¬ 
cantly affected by the type of graph (random or power-law). 
Second, runtime has an almost linear relationship with the 
query set size, as well as the input graph size. However, as 
expected by Theorem runtime is most impacted by the 
graph size rather than the query-set size. 

Parallelization. Examining Algorithm we notice that 
we can easily speed-up our method via parallelization (e.g., 
via a Map-Reduce execution), assuming the graph G fits in 
memory. In fact, by launching \Q\ threads in parallel (Map), 
we can achieve a linear speedup of \Q\. Each thread exam¬ 
ines one root r € Q and computes shortest-path distances 
in G from r to construct and solve the Steiner tree instances 
for different choices of the parameter A. Then, all possi¬ 
ble solutions can be collected (Reduce), and the best one 
chosen. To this end, each thread needs to evaluate its can¬ 
didate solutions. Since these are typically small in practice, 
the thread can compute the induced shortest-path distances 
from all vertices in its solutions and compute their Wiener 
indices exactly. In the (unlikely) scenario that a solution is 
large, we can instead approximate the Wiener index (see Re¬ 
mark [^. This preserves the approximation guarantee while 
providing an overall speedup of |Q|. If G is too large to 
fit in memory, it becomes necessary to employ techniques 
for parallel and/or approximate shortest-distance computa¬ 
tions |52[ 1^ |40[ |45| , but these are beyond the scope of this 
work. 


7. CASE STUDIES 


Protein-Protein-Interaction network. Network analy¬ 
sis has established itself as a central component of computa¬ 
tional and systems biology. Barabasi et al drew attention 
to the great potential of “network medicine” in the study 
of diseases. This work highlighted the utility of identify¬ 
ing not only vertices with high betweenness centrality, but 
also those that act as links between diseases. Finding such 
vertices may lead to the discovery of new protein-disease as¬ 
sociations and a deeper understanding of the relationships 
between diseases [Hl[T2l|47)[7]. 

The minimum Wiener connector fits well in this setting, 
as it aims at finding few central vertices that connect a 
given set of query vertices. As a proof of concept we use a 
human Protein-Protein-Interaction (PPI) network collected 
from BioGric0with 15 312 vertices. To demonstrate the util¬ 
ity of WS-Q we require a ground truth about the relationship, 
so we select query proteins that have been the subject of pre¬ 
vious biological study. In Figure]^ we report the minimum 
Wiener connector for a query set shown in grey, and the so¬ 
lution connector-vertices in white. For each query node we 
analyze the disease-association of its next-hop in the connec¬ 
tor, and find that it is indicative of the studied association of 
the query node. For example, we observe that the next-hop 
of BMPl is p53 which is widely regarded as central in can¬ 
cer; we then verify in the literature that in fact BMPl has 
also been linked to cancer [55||51| . Similar literature-verified 
examples are: 

• PSEN is related to the other query nodes through 
GSK3B - uncovering its role in Alzheimer’s disease. 

• JAK2 is connected through HSP90 which has been 
studied for its potential therapeutic role in JAK-related 
diseases [6 

• SLC 6 A 4 is suspected to play a role in Alzheimer dis¬ 
ease, and is connected to SNCA, a known factor in 
Alzheimer’s. 

Further, the high connectivity of the inner nodes insinuates 
a close relationship between Cancer and Alzheimer’s (e.g., 
as seen by the interaction between p53 and GSK3B), which 
has in fact been a topic of interest and study [44[ 

This sample query is exemplary of the quality and poten¬ 
tial of finding the minimum Wiener connector. Identifying 
not only high betweenness and important nodes, but also 
those that act as links, gives potential for new directions 
of investigation for protein-disease and disease-disease rela¬ 
tionships. The connector also provides a concise summary 
of the relationships that is amenable to visualization. 
Social network. The next case study is based on a graph 
created over Twitter users taking part in the ACM SICKDD 
2014 conference. The graph contains 1141 Twitter users 
whose tweets over the three-day period contained the hash- 
2 

http://thebiogrid.org/ 
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Figure 5: Computational runtime of WS-Q on different synthetic Power-Law (PL) and Erdos-Renyi (ER) graphs (first row) 
and real-world graphs (second row), with varying query set size and fixed graph size (left column), and varying graph size 
and fixed query set size (right column). 




Figure 7: Two minimum Wiener connectors extracted from 
the Twitter ^kdd2QlA graph. 


Table 5: Statistics on Tweeters in ^kdd2GlA graph. 


Userid Followers Notes 


kdnuggets 

23.1k 

Top-1 mentioned in entire graph Sz Gl 
Top-1 betweenness in entire graph 

Top-3 word in entire graph & Gl 

Top-6 mentioned in G2-GS 

Top-10 word in G4 

Top-2 replied-to in GS 

drewconway 

10.7k 

Top-7 mentioned in entire graph Sz Gl, G4 
Top-4 replied-to in entire graph & G2, G3 
Top-6 word in G4 

gizmonaut 

304 

Top-9 tweeter in GlO 

irescuapp 

204 

Top-7 mentioned in GlO 

jromich 

165 

Top-7 replied-to in Gl 

francescobonchi 

619 

Top-7 mentioned in G2 


tag ^kdd2Q14:, or who replied to or were mentioned in these 
tweets. There is an edge between two users for each reply or 
mention. The Clauset-Newman-Moore algorithm was used 
to cluster the graph into 10 communitiesH 

Figure reports two minimum Wiener connectors ex¬ 
tracted with query sets Q (shown in gray) consisting of 
vertices belonging to different communities. The vertices 
chosen to be combined with Q to produce the solution sub¬ 
graph H are, in both cases, users that exhibit some influence 
or leadership. In particular, we observe that in both exam¬ 
ples, H contains the users kdnuggets and drewconway, each 
of which have a very large set of Twitter followers (23.1fc and 
lOfc respectively), and turn out to be the top mentioned and 
replied-to users in the whole #fcdd2014 dataset. Table 
contains more detailed information. In particular it shows 
that the other intermediate vertices included in the mini¬ 
mum Wiener connectors also exhibit high levels of activity 
and are among the top-10 mentioned or replied-to users in 
their respective communities. 

^ https://nodexlgraphgallery.org/Pages/Graph.aspx?graphID=26533 


8. CONCLUSIONS 

In this paper we introduced the Min Wiener Connec¬ 
tor problem: given a graph and a set of query vertices, find 
a subgraph that connects the query vertices while minimiz¬ 
ing the sum of pairwise shortest-path distances within that 
subgraph. In such simple and elegant formulation, the ob¬ 
jective function favors small solutions built by adding impor¬ 
tant (central) vertices to connect the given query vertices. 
Thanks to these features, the minimum Wiener connector 
lends itself naturally to applications in biological and social 
network analysis. 

We showed that the problem is NP-hard, cannot admit 
any PTAS, and has an exact (yet impractical) algorithm that 
runs in polynomial time for the special case where the size 
of the query set is constant. Also, as a major contribution, 
we provided a constant-factor approximation algorithm that 
runs in time proportional (up to logarithmic factors) to the 
size of the input graph and the number of query vertices. 
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APPENDIX 

A. REMAINING PROOFS 
A.I Proof of Theorem (Section 

Recall that a homomorphism between two graphs H 
and H' is a mapping cf) : V{H) —>■ V{H') such that {u,v) £ 
E{H) implies (0(u),0(u)) £ E{H'). 

Lemma 6. Let (f : H ^ H be a surjective graph homo¬ 
morphism. Then'W{H') <'W{H). 

Proof. The existence of a homomorphism implies that 
for every path p in H, there is a corresponding path in 
H' (which may not be simple even if p is). Therefore 
dH'{(j){u),(j}{v)) < dH{u,v), and 

W{H)= Y. dHiu,v) 

u,vGV(H) 

u,vGV(H) 

> Y dH'{u,v') 

u' {H)) 

= Y dH'{u',v'). 

u',v'eV{H') 

The second inequality uses the fact that every pair u',v' 
from the image of V{H) under (f is counted at least once as 
djji{(j>{u),(j>{v)) for some u,v £ V{H). The last equality is 
by the surjectivity of (f. □ 

Lemma 7. Let G be a graph, H be a connected subgraph 
ofG,QQ V{H) and let 

A = Qu{v£V{H)\degHiv)>2 }. 

We call A the set of pivotal vertices of H with respect to Q. 

Call a path p between two vertices of A basic if the internal 
vertices of p are outside A; say that an unordered pair of 
vertices u,v £ A is neighbouring if u,v £ V {H) and there 
is a basic path from u to v. Suppose we construct a graph H' 
by including the vertices and edges of an arbitrary shortest 
path in G between each pair of neighbouring elements of A. 
Then Q C V{H') and either W{H') = W{H), or H is not 
a minimum Wiener connector for Q. 

Proof. It suffices to show that if 77 is a minimum Wiener 
connector for Q, then we can construct a surjective homo¬ 
morphism <j) from 77 to 77' whose restriction to A is the iden¬ 
tity. Indeed, then clearly Q C A C V{H'), and Lemma 
would imply W(77') < W(77). 

For each neighbouring pair (u,v) £ Ax A, there is a unique 
basic path p = po .. .pt in H between u = po and v = pt 
(otherwise removing some path yields a smaller connector). 
The graph 77' contains a path go,... ,qt' in G (where t' <t)\ 
dehne (f>{pi) = 5min(i,i') for i £ {0, ...,7'}. The map (f is 
well-dehned because the internal vertices of all these paths 
in 77 are distinct, as their degree is 2. By construction, (j> is 
surjective on V{H') and 4>{u) = u for all u £ A. Moreover, 
the image of every edge in the unique basic path between 


a pair of neighbouring vertices is an edge of H' . Since any 
edge of H must belong to some basic path (or else we could 
remove one of its endpoints while reducing the Wiener index 
of H), it follows that <j) is a. homomorphism, as claimed. □ 

Lemma 8. Let H be a connected graph and let V = 
({si, ti})i6[m] denote a sequence ofm unordered pairs of dis¬ 
tinct vertices of H. Write T = {si, and call a se¬ 

quence pi, ■ ■ ■ ,Pm of paths in H valid if for all i € [m], pi 
is a shortest path between Si and ti. There is a valid se¬ 
quence pi, ■ . ■ ,Pm of paths in H such that in the subgraph 
H' = Ui6[m]P* 0 / ^ there are at most m{m — 1) vertices 
with degree different from two, and W(Lf') < W(Li'). 

Proof. Since H is connected, there is some valid se¬ 
quence of paths; what we need to show is that there is one 
with the degree property. We argue by induction on m. 
When m = 1, all the internal vertices of any shortest pair 
between si and ti have degree two, so we are done. Suppose 
the theorem holds for m — 1 vertex pairs; let pi,... ,Pm-i 
be the paths in the conclusion of the lemma, and take an 
arbitrary shortest path pm from Sm to tm- We claim that 
we can replace the path pm. with another path such that, 
for each pi with i < m, at most two vertices have a different 
successor or predecessor in the two paths p'„, and p^. This is 
easy to see because if p™ meets pj, leaves it and intersects 
it for a second time, we can replace the part of pm between 
the two intersections by a subpath of pj . 

Consequently, when we add the edges in path p'm to the 
subgraph IJi 6 [m-i] P*’ i < rn we increase the degree 

of at most two vertices that belong to pi. Therefore the 
total number of vertices of degree larger than two is at most 
(m — l)(m — 2) -I- 2(m — 1) = m(m — 1), as desired. 

Finally, note that the above path-replacement procedure 
cannot increase the Wiener index as H' is a subgraph of H 
that maintains shortest-path distances. □ 

Lemma 9. Let k = \Q\. For any graph G, there is an 
optimal solution to Min Wiener Connector where at most 
k'^ vertices have degree in H larger than two. 

Note that such H is not necessarily an induced subgraph. 

Proof. Let H be an optimal solution. Consider a se¬ 
quence V containing all m = ( 2 ) pairs of distinct query 
nodes. By Lemma[^ there is a sequence Pi,... ,Pm of short¬ 
est paths in H (one for each pair of query nodes) such that 
in the graph H' formed by the union of all these paths, there 
are most m(m — 1 ) vertices with degree different from two. 
Clearly H' is a connector for Q (since it contains paths link¬ 
ing each pair of query nodes). Since W [H') < W [H) and H 
is optimal, it follows that W(iL') = W(1T). Also, there can¬ 
not be any non-query vertices of degree 1 , otherwise we could 
remove them from H' and still obtain a connector of Q with 
smaller Wiener index. So the total number of vertices of 
degree larger than 2 in H' is at most k-\-m{m— 1) < □ 

Proof of Theorem Let k = \Q\. We can loop over 
all possible vertex subsets of size fc'^; by Lemmaone 
of them will be the set X of vertices of degree 2 in the 
optimal solution H* . Then AuQ is the set of pivotal vertices 
of H* with respect to Q; we can construct in polynomial 
time a graph H' as in Lemma[^ and we will have W(7L') < 
W(L/*), hence W{H*) = W{H'). 

Overall, the algorithm runs in time □ 


A.2 Proof of Lemma [T] (Section Q 

Let r* = argmin^ dniv, r). Observe that 
\V{H)\ ■^d„{v,r*) = '^{^d„{v,r*)) 

V W V 

(«’«')) =2W(Lf), 

W V 

and 2W(H) < ^ (^[dir(u,r*) -\- duir*,w)]) 

W V 

= '^dH{v,r*) -f ^drr(r*,w) 

v,w v,w 

= 2j2dH{v,r*) = 2\V{H)\-J2dHiv,r*), 

V,W V 

where we used the choice of r* in the first inequality and the 
triangle inequality for dn in the second inequality. □ 

A.3 Proof of Lemma (Section 

Let Ts be the shortest-path tree from r to the elements 
of T, determined by an array of distances ds[] and an array 
of parent links ps[]. Consider the algorithm below, which 
traverses T and performs a series of edge relaxations that 
add additional vertices from Ts in order to decrease dis¬ 
tances to the root r. The important invariant maintained 
is that the edges 'i^)} | d[v] 7 ^ 00 } form a subtree of 

T \JTs, with d[v] an upper bound on the distance between 
the root and v in the tree. 


Algorithm 2 AdjustDistances 

Input: A graph G = (L, E); a subtree T; a root node v £ V{T)] 
and a BPS tree from r with parent array ps[\ and distance 
array ds []. 

Output: A tree. 

1: Construct hash tables d[],p[], with default values p[v] = nil 
and d[v] = 00 for all v £ V{G). 

2 : d[r\ ■<— 0. 

3: dfs(r). 

4: return the tree T' = | v £ V{G) A p[v\ 7 ^ nil}. 


Algorithm 3 dfs 

Input: A vertex u. 

1 : if d[u\ > (1 \/ 2 )ds[u] then 

2: AddPath(u) 

3: end if 

4: for each child u of u in T do 
5: relax(ii,u) 

6: dfs(ii) 

7: relax(u,ii) 

8 : end for 


Algorithm 4 Add Path 

Input: A vertex u. 

1: u ■<— u 

2 : while d[v\ > ds[v] do 
3: relax(p 5 [u], u) 

4: u •(— p[v] 

5: end while 


We need to show that AdjustDistances runs in time 
0{\V{T)\) and returns a tree T' satisfying the following: 

(a) y(r') D T/(r); 














Algorithm 5 Relax 

Input: two adjacent vertices u^v. 
1 : if d[v] > d[u] + 1 then 
2 : d[v] ■<— d[u] + 1 

3: p[v] -(r- u 

4: end if 


(b) \V{T')\<{l + ^)\V{T)\- 

(c) for all V G V{T'), dTi{r,v) < (1 + da{r,v). 

(d) E„ 6 V'(t') dG(r, v)<V 2 E„ev(T) ^dr, v). 


Property a) holds because every time we insert a new ver¬ 
tex u in the tree (that is, p[u] becomes 7 ^ nil), it is never re¬ 
moved again; and the call dfs(r) visits all vertices of T. Prop¬ 
erty c) holds because d[u] = d 5 [M] for all u € V{T') \ V{T), 
and whenever ^[m] > (1 -|- -\/2)ds[u] for some u € V{T), we 
add a path to achieve d[u] = ds[w]. 

Next we analyze the running time. For d[] and p[] we 
use a resizable hash table with constant expected amortized 
update/lookup time [16[ 43 . When an element v is not in 
the table, we insert p[uj ■(— nil and d[v] <— 00 . This way lines 
1-2 of AdjustDistances take time 0(1). We also keep track of 
which elements v € V (G) have been assigned values in the 
table, so line 4 takes time 0(| V^(T')|) rather than 0(|F(G)|). 
The running time of dfs(r) (excluding line 1, which is run 
0(|y(r)|) times) is proportional to the number of calls to 
relax made by Add Path and the recursive calls to dfs. The 
number of relaxations is 0(|V^(r')|) because every edge of T 
or Ts is relaxed at most twice by dfs and at most once by 
AddPath. Therefore the running time of dfs(r) is 0(|F(T')|), 
which is also 0(|y(r)|) assuming property b). 

Now we show property b). As the algorithm executes, de¬ 
fine a potential function 4? to be the distance estimate of the 
current vertex (for ease of notation we omit the dependence 
of 4? on the current time). When a shortest path of length 
i = ds[u] to the current vertex u is added by AddPath(u), 
(j) = d[u] > al, where a = 1 -I- \/2. Adding the path lowers 
d[u] to I, decreasing (j> by at least (a — 1)1. Hence the to¬ 
tal length of the added paths is bounded by the sum of the 
decrements to <)> during the course of the algorithm, divided 
by a — 1. Since (j) is initially 0 and always nonnegative, 
the sum of the decreases is at most the sum of the incre¬ 
ments. The potential 4> increases only when the current 
vertex changes from some vertex u to a vertex v after the 
edge (u, v) was relaxed, which ensures that d[w] < d[w] -|- 1 
and that 4? increases by at most 1. Since each edge is tra¬ 
versed twice, the total of the increases to 4> during the course 
of the algorithm is bounded by twice the number of edges 
in T. This establishes that the total length of the added 
paths is bounded by 2/(a — 1 ) = ^/2 times the total number 
of edges of T. Thus, \V{T')\ < (1 -I- y/2)\V{T)\, showing b). 

Only property d) remains to be shown. Define analogously 


a potential tk to be when the current vertex is n G 

V{T). Adding a shortest path of length I when d[u] > (1 -|- 
\/2)l lowers by at least £^(1 -|- \/2). The vertices added 
increase the sum of distances from r by at most (^ 2 ^) < 12 . 

Hence the total increase in sum of distances is bounded by 
the sum of the decrements to T, divided by 2(1-1- \/2). The 
sum of the deceases is at most the sum of the increases. 
The potential increases by at most dciv) when relaxing 
an edge (u,v). Since each edge is traversed twice, the total 
of the increases to is bounded by 2 ( 1 +^) E„gv(T) ^g(v). 


Hence E^ev(T')\v(T)'^ g(v) < Evev(T)‘^ g(v), which 

implies c). □ 


A.4 Proof of Lemma (Section 

We need the following lemma. 

Lemma 10. Let xo,yo £ R"*" <ind A = Then, for all 

x,y G it holds that 

d]L<(E^±±y 

xoyo \xoX+^J 

Proof. Our choice of A implies that 'Ixoyo = 
(xqA -|- Recall that, by the AM-GM inequality, '/aT < 

=> 4ab < (a -b b)^ for all a, 6 G R^. Hence 


xy ^ 4xy ^ 4(a;A) (f) ^ + x \ ^ 

Xoyo 4:Xoyo (^xqX + ^)^ ~ \^oX + ^ J 


Now we are ready to show Lemma Let A* denote the 
optimal solution to Problemand set A = ^ 

It is easy to see that A G [1/ \f2, •y/|I/(G)|] as all distances 
dG{r,u) but one (i.e., dG{r,u) which is equal to 0 ) are in 
the range [1, |H(G)|] and |A*| > 2. Let B* (resp., B) denote 
the optimal solution (resp., an a-approximate solution) to 
Problem]^ By our choice of A, Lemmaimplies that 


A(H,r) ^ \B\EueBdG{r,u) 
A{A*,r) \A*\EueA-dG{r,u) 

^ ( \B\X+ \ Eu&BdG{r,u) y 

Vl^*l^+xEueA*^G(r,u)J 

f B{B,r) y ^ f B(B,r) V 
\B{A*,r)J - \B{B*,r)J 


where the last inequality follows from B(A*,r) > B(B*,r) 
(due to the optimality of B* for Problem]^. To complete 
the proof, note that < a holds by hypothesis. □ 


A.5 Proof of Theorem (Section 

We have to show that an optimal solution to | 6 f gives an 
optimal solution to our problem, and viceversa. 

It is clear that any solution S C V{G) to Min Wiener 
Connector yields a feasible solution to : set yu = 1 
and Puv = 1 iff u, u G 5, and for each s,t G S, pick a 
shortest path zq = s,zi, ... ,zt = t from s to t in 5 and set 
fzEi+i ~ other variables to 0. This satisfies all 

constraints and the objective function coincides with W(S'). 

Conversely, consider an optimal solution to (§ and let 
S = {u G V \ yu = 1}', note that S A Q. We show that 
the objective function is at least W(G[S]). For any s,t G S, 
Pst > ys+yt — 1 > 1. The constraints now imply that we can 
route Pst > 1 units of flow from s to t, where the capacity 
of directed edge (u,v) is at most j/„. Note that once yu 
and Pst have been fixed, all remaining constraints involve 
only flow variables and constants, and only flow variables 
with the same pair s,t. Therefore, the only way to minimize 
the objective function is to find, for each s, t, a min-cost 
flow of value pst (where the cost and capacity of every edge 
is one), i.e., a shortest-path from s to t in S. Thus there 
is such a path p, and the sum of fEv for the directed edges 
{u,v) of p is at least ds{s,t). As this holds for all s,t, the 
result follows. □ 

























